Skip to content

[Performance] Sequential onloading#1263

Merged
dsikka merged 43 commits intomainfrom
kylesayrs/sequential-onloading
Jun 17, 2025
Merged

[Performance] Sequential onloading#1263
dsikka merged 43 commits intomainfrom
kylesayrs/sequential-onloading

Conversation

@kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Mar 18, 2025

Sequential Onloading

Screenshot 2025-06-05 at 22 53 01

(25/33): Calibrating:   0%|                                                                                                                                  | 0/512 [00:00<?, ?it/s]
<class 'transformers.models.llama.modeling_llama.LlamaRMSNorm'>.weight -> cuda
<class 'torch.nn.modules.linear.Linear'>.weight -> cuda
<class 'torch.nn.modules.linear.Linear'>.weight_scale -> cuda
<class 'torch.nn.modules.linear.Linear'>.weight_zero_point -> cuda
...
(25/33): Calibrating: 100%|█████| 512/512 [00:23<00:00, 21.91it/s]
2025-06-03T17:29:15.536963-0400 | compress_modules | INFO - Quantizing model.layers.24.self_attn.q_proj using 512 samples
2025-06-03T17:29:17.328720-0400 | compress | METRIC - time 1.79s
2025-06-03T17:29:17.329265-0400 | compress | METRIC - error 8948.54
2025-06-03T17:29:17.329781-0400 | compress | METRIC - GPU 0 | usage: 5.41% | total memory: 85 GB
2025-06-03T17:29:17.330248-0400 | compress | METRIC - Compressed module size: 33.947648 MB
...
(25/33): Propagating: 100%|█████| 512/512 [00:03<00:00, 131.16it/s]
<class 'transformers.models.llama.modeling_llama.LlamaRMSNorm'>.weight -> meta
<class 'torch.nn.modules.linear.Linear'>.weight -> meta
<class 'torch.nn.modules.linear.Linear'>.weight_scale -> meta
<class 'torch.nn.modules.linear.Linear'>.weight_zero_point -> meta
...

Purpose

  • Reduce hardware requirements for calibrating large models
  • Reduce runtime caused by excess device movement when calibrating offloaded models

Prerequisites

Related Issues

Changes

  • Keep layer parameters onloaded during the entire sequential calibration + compression + propagation step
    • This is achieved through the keep_onload_context, which disables offloaded until the context is exited
  • Dispatch model within each calibration pipeline
    • Sequential pipeline offloads the model to CPU, and executes on the first cuda device
    • Deprecate passing sequential_targets via modifiers, instead prefer passing via oneshot argument
  • Use sequential pipeline as default pipeline (basic pipeline is never used)
    • Deprecate passing sequential_targets via modifiers, instead prefer passing via oneshot argument
  • Dispatch model before sample generation
    • The model is dispatched exactly as it would be if it was loaded with device_map="auto"

Examples

  • Models are loaded onto cpu before oneshot (rather than being dispatched across GPUs)
  • Model is reloaded from disk in order to redispatch onto "auto" device map
    • In my opinion, this is a better flow anyways, since models can raise errors / take a very long time during generation, which can cause the entirely compression job to go to waste
    • The alternative is to either call accelerate.remove_hooks(model) and accelerate.dispatch_model(model) before generating, or get rid of sample generation entirely. One of these may be required if compressed_linear isn't reliable enough to add to our examples
New example script
from transformers import AutoModelForCausalLM

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.utils.dev import dispatch_for_generation

# Load model (on cpu)
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")  # model is loaded on cpu
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Define recipe
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])

# Apply oneshot (model execution device is set to cuda, model stays on cpu)
oneshot(
    model=model,
    dataset="ultrachat_200k",
    recipe=recipe,
    max_seq_length=2048,
    num_calibration_samples=512,
)

# Perform sample generation
print("\n\n")
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")

# Save to disk before generating
SAVE_DIR = model_id.split("/")[1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

Testing

  • Calibrated and GPTQ-compressed one layer of Deepseek-V3 with a single H100 in 50 seconds
    • 4.5x Improvement over original 236 seconds
    • Peak memory of ~40 GB, which can be further reduced by increasing the granularity of sequential targets
  • Not offloading activations did not result in a performance improvement

@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@kylesayrs kylesayrs added the ready When a PR is ready for review label Mar 18, 2025
@kylesayrs kylesayrs self-assigned this Mar 18, 2025
@brian-dellabetta brian-dellabetta self-requested a review March 18, 2025 14:17
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, i approved this thinking it was the one-liner removing clear-ml, will have to take a closer look

@brian-dellabetta brian-dellabetta dismissed their stale review March 18, 2025 14:20

sorry, i approved this thinking it was the one-liner removing clear-ml, will have to take a closer look

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am understanding this for the most part -- very cool!

@kylesayrs
Copy link
Collaborator Author

def DoNotOffloadContext():
	  to_offload = {}
    def patched(module):
        to_offload.add(module)
        return None

    with patch_attr(AlignDeviceHook, "post_forward", patched)
        yield
        
    # offload on exit
    for module in to_offload:
        module.post_forward()
        
        
for subgraph in subgraphs():
    with DoNotOffloadContext():
        subgraph(**inputs)

@kylesayrs kylesayrs force-pushed the kylesayrs/sequential-onloading branch from 382c3e6 to 7586733 Compare June 3, 2025 21:31
@kylesayrs kylesayrs removed the ready When a PR is ready for review label Jun 3, 2025
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@kylesayrs kylesayrs added the ready When a PR is ready for review label Jun 16, 2025
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@kylesayrs
Copy link
Collaborator Author

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome stuff!

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@kylesayrs
Copy link
Collaborator Author

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty sweet that we're able to remove ~500 lines of code while adding a huge feature like this 🔥

Screenshot 2025-06-17 at 1 50 41 PM

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work

@dsikka dsikka enabled auto-merge (squash) June 17, 2025 19:58
@dsikka dsikka merged commit f4e484d into main Jun 17, 2025
18 checks passed
@dsikka dsikka deleted the kylesayrs/sequential-onloading branch June 17, 2025 20:45
dsikka added a commit that referenced this pull request Jun 25, 2025
## Purpose ##
* Speed up tests by reducing device movement

## Background ##
As of #1263, the model is dispatched to different device maps depending
on which pipelines are used. If the model starts on anything but the
CPU, then these dispatches and undispatches create device movement.
Starting on the CPU will ensure no device movement occurs when offloaded
dispatches happen.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
kylesayrs added a commit that referenced this pull request Jun 30, 2025
## Purpose ##
* Speed up tests by reducing device movement

## Background ##
As of #1263, the model is dispatched to different device maps depending
on which pipelines are used. If the model starts on anything but the
CPU, then these dispatches and undispatches create device movement.
Starting on the CPU will ensure no device movement occurs when offloaded
dispatches happen.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
aireilly pushed a commit to aireilly/llm-compressor that referenced this pull request Jul 30, 2025
…ssed (vllm-project#1530)

## Purpose ##
* Fix failing examples

## Changes ##
* Save model after generation in all examples
* Previously, models would be saved before generation, causing
generation to fail because we do not fully support generating with
compressed models atm

## Future ##
* In the future, we can define a better API around compressing and
decompressing models which does not require so many arguments
* In the future, we can standardize around reloading (and redispatching)
the model before generation, as suggested here vllm-project#1263
* In the future, we can remove the sample generation step

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
aireilly pushed a commit to aireilly/llm-compressor that referenced this pull request Jul 30, 2025
# Sequential Onloading #
<p align="center"><img width="403" alt="Screenshot 2025-06-05 at 22 53
01"
src="https://github.com/user-attachments/assets/ffd610ac-c511-4dc1-b858-b0ed2bf95193"
/></p>

```
(25/33): Calibrating:   0%|                                                                                                                                  | 0/512 [00:00<?, ?it/s]
<class 'transformers.models.llama.modeling_llama.LlamaRMSNorm'>.weight -> cuda
<class 'torch.nn.modules.linear.Linear'>.weight -> cuda
<class 'torch.nn.modules.linear.Linear'>.weight_scale -> cuda
<class 'torch.nn.modules.linear.Linear'>.weight_zero_point -> cuda
...
(25/33): Calibrating: 100%|█████| 512/512 [00:23<00:00, 21.91it/s]
2025-06-03T17:29:15.536963-0400 | compress_modules | INFO - Quantizing model.layers.24.self_attn.q_proj using 512 samples
2025-06-03T17:29:17.328720-0400 | compress | METRIC - time 1.79s
2025-06-03T17:29:17.329265-0400 | compress | METRIC - error 8948.54
2025-06-03T17:29:17.329781-0400 | compress | METRIC - GPU 0 | usage: 5.41% | total memory: 85 GB
2025-06-03T17:29:17.330248-0400 | compress | METRIC - Compressed module size: 33.947648 MB
...
(25/33): Propagating: 100%|█████| 512/512 [00:03<00:00, 131.16it/s]
<class 'transformers.models.llama.modeling_llama.LlamaRMSNorm'>.weight -> meta
<class 'torch.nn.modules.linear.Linear'>.weight -> meta
<class 'torch.nn.modules.linear.Linear'>.weight_scale -> meta
<class 'torch.nn.modules.linear.Linear'>.weight_zero_point -> meta
...
```

## Purpose ##
* Reduce hardware requirements for calibrating large models
* Reduce runtime caused by excess device movement when calibrating
offloaded models

## Prerequisites ##
* vllm-project/compressed-tensors#354
* vllm-project/compressed-tensors#355
* vllm-project/compressed-tensors#356
* vllm-project/compressed-tensors#357

## Related Issues ##
* Resolves vllm-project#1383
* Resolves vllm-project#1228
* Resolves vllm-project#1122
* Resolves vllm-project#1078
* Resolves vllm-project#1216
* Resolves vllm-project#1483

## Changes ##
* Keep layer parameters onloaded during the entire sequential
calibration + compression + propagation step
* This is achieved through the `keep_onload_context`, which disables
offloaded until the context is exited
* Dispatch model within each calibration pipeline
* Sequential pipeline offloads the model to CPU, and executes on the
first cuda device
* Deprecate passing sequential_targets via modifiers, instead prefer
passing via oneshot argument
* Use sequential pipeline as default pipeline (basic pipeline is never
used)
* Deprecate passing sequential_targets via modifiers, instead prefer
passing via oneshot argument
* Dispatch model before sample generation
* The model is dispatched exactly as it would be if it was loaded with
`device_map="auto"`

### Examples ###
* Models are loaded onto cpu before oneshot (rather than being
dispatched across GPUs)
* Model is reloaded from disk in order to redispatch onto "auto" device
map
* In my opinion, this is a better flow anyways, since models can raise
errors / take a very long time during generation, which can cause the
entirely compression job to go to waste
* The alternative is to either call `accelerate.remove_hooks(model)` and
`accelerate.dispatch_model(model)` before generating, or get rid of
sample generation entirely. One of these may be required if
compressed_linear isn't reliable enough to add to our examples

<details><summary>New example script</summary>

```python3
from transformers import AutoModelForCausalLM

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.utils.dev import dispatch_for_generation

# Load model (on cpu)
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")  # model is loaded on cpu
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Define recipe
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])

# Apply oneshot (model execution device is set to cuda, model stays on cpu)
oneshot(
    model=model,
    dataset="ultrachat_200k",
    recipe=recipe,
    max_seq_length=2048,
    num_calibration_samples=512,
)

# Perform sample generation
print("\n\n")
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")

# Save to disk before generating
SAVE_DIR = model_id.split("/")[1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
```
</details>

## Testing ##
* Calibrated and GPTQ-compressed one layer of Deepseek-V3 with a single
H100 in 50 seconds
  * 4.5x Improvement over original 236 seconds
* Peak memory of ~40 GB, which can be further reduced by increasing the
granularity of sequential targets
* Not offloading activations did not result in a performance improvement
* TODO: Test all example models can be reloaded and run

---------

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Co-authored-by: Brian Dellabetta <bdellabe@redhat.com>
aireilly pushed a commit to aireilly/llm-compressor that referenced this pull request Jul 30, 2025
## Purpose ##
* Speed up tests by reducing device movement

## Background ##
As of vllm-project#1263, the model is dispatched to different device maps depending
on which pipelines are used. If the model starts on anything but the
CPU, then these dispatches and undispatches create device movement.
Starting on the CPU will ensure no device movement occurs when offloaded
dispatches happen.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready When a PR is ready for review

Projects

None yet

5 participants